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Abstract 


PLATINUM is an operating system kernel with a novel mem- 
ory management system for Non-Uniform Memory Access 
(NUMA) multiprocessor architectures. This memory man- 
agement system implements a coherent memory abstraction. 
Coherent memory is uniformly accessible from all processors 
in the system. When used by applications coded with ap- 
propriate programming styles it appears to be nearly as fast 
as local physical memory and it reduces memory contention. 
Coherent memory makes programming NUMA multiproces- 
sors easier for the user while attaining a level of performance 
comparable with hand-tuned programs. 

This paper describes the design and implementation of 
the PLATINUM memory management system, emphasizing 
the coherent memory. We measure the cost of basic opera- 
tions implementing the coherent memory. We also measure 
the performance of a set of application programs running 
on PLATINUM. Finally, we comment on the interaction be- 
tween architecture and the coherent memory system. 

PLATINUM currently runs on the BBN Butterfly Plus™ 
Multiprocessor. 


1 The Need for Transparent Manage- 
ment of Non-Uniform Memory 


PLATINUM is an operating system kernel designed to be a 
platform for research on memory management systems for 
Non-Uniform Memory Access (NUMA) multiprocessor ar- 
chitectures, those in which the distributed, shared memory 
of the machine can be referenced by any processor on the 
machine, but the cost of accessing a particular physical loca- 
tion varies with the distance between the processor and the 
memory module. The name “PLATINUM” is an acronym 
for “Platform for Investigating Non-Uniform Memory”. Its 
purpose is the experimental evaluation of a software imple- 
mentation of a coherent memory abstraction on top of non- 
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uniform access physical memory architectures. PLATINUM 
runs on BBN Butterfly Plus™ Parallel Processors. 

One can achieve impressive speedup due to parallelism on 
a NUMA multiprocessor, but unfortunately this can entail 
a considerable effort. Because remote memory references 
are an order of magnitude more expensive than local refer- 
ences and because remote references are subject to several 
forms of potential contention, the physical location of data 
is critical to performance. On the BBN Butterfly7™ Par- 
allel Processor, a popular and productive way to deal with 
the problem of shared data location is to avoid the ques- 
tion by using libraries [20] and languages [26] that support 
message passing. When using a non-uniform access memory 
directly, however, one has to deal with data locality. This 
programming of data locality is reminiscent of the explicit 
management of memory hierarchies using overlays: attain- 
ing performance can be non-intuitive and can depend upon 
dynamic properties of program execution; worse, it has to 
be done explicitly by every application programmer. The 
importance of this tuning is such that a programmer can 
expend far more effort on “programming the memory archi- 
tecture” than in solving the original application problem. 

Our goal is to explore the possibility of achieving per- 
formance comparable to that of hand-tuned programs with 
a simple, easy-to-program shared-memory model. It is our 
hypothesis that it is crucial to present users with a simple 
model of shared memory implemented so as to attain good 
parallel performance on applications written in a natural 
programming style. The coherent memory model imple- 
mented by PLATINUM is an exercise in doing this trans- 
parently in an operating system kernel on top of an exist- 
ing NUMA multiprocessor. Because we wish to explore the 
limits of this approach, PLATINUM assumes neither special 
architectural support nor extensive language-specific assis- 
tance from a compiler. We do believe that these are vital 
in the long run and exploring these issues is a part of our 
long-term research interests. 

NUMA multiprocessor organization leads to memory 
management design choices that differ markedly from those 
that are common in systems designed for uniprocessors or 
UMA multiprocessors. If two or more processes on a unipro- 
cessor are sharing read-only data such as a common code 
segment, it is wasteful to allocate multiple private copies. 
Such replication is expensive in terms of the number of phys- 
ical pages used and in terms of the expense of copying the 
data. For example, to reduce this expense, the implemen- 
tation of Mach [23] minimizes the amount of data-copying 
and replication through the use of copy-on-write and other 


techniques. 

In contrast, extra data motion in the form of replication 
and migration can yield greatly improved performance on 
a NUMA machine. Placing data in the local memory of a 
processor that is using it decreases memory access latency. 
More importantly, a processor accessing local data is not 
performing remote operations that contend for remote mem- 
ory modules and for the processor-memory switch. These 
two factors also motivate the use of caches in bus-based mul- 
tiprocessors [13]. The advantages of replication and data 
motion distinguish the problem of managing memory on a 
NUMA machine from the same problem on uniprocessors 
and Uniform Memory Access (UMA) multiprocessors. 

PLATINUM’s implementation of coherent memory repli- 
cates and migrates data to the processors using it, thus cre- 
ating the appearance that memory is uniformly and rapidly 
accessible. The protocol for controlling this data movement 
is derived by extending a directory-based cache coherency 
algorithm using selective invalidation [7, 2]. The extension 
exploits the NUMA architecture by adding the option of 
using the remote memory access mechanism rather than 
replicating or migrating data to local memory on an ac- 
cess miss. Using remote memory access effectively disables 
caching on a block-by-block basis. This is crucial when 
write-shared data is modified at fine temporal and spatial 
granularities because the overhead of executing a coherency 
protocol can be more expensive than not caching. With the 
large block sizes and overheads associated with software- 
assisted caching, the effect can be especially bad. This is 
a critical distinction between NUMA memory management 
in PLATINUM and the software caching of Li’s Distributed 
Virtual Memory [22] or the software-controlled caching of 
the VMP Multiprocessor [9, 8]. 

Because the measured performance of real applications 
is a far better indicator of the success of a system than 
analytic predictions, simulations, or simplified experiments, 
PLATINUM provides enough of a general-purpose applica- 
tion environment to support such programs. We are actively 
building a library of applications designed to test the perfor- 
mance of PLATINUM with a variety of programming styles 
that use different memory access patterns. The results are 
encouraging. Figure 1 plots the speedup of a program that 
simulates Gaussian elimination without pivoting on dense 
matrices. In this case the input is 800 by 800. This par- 
ticular problem was chosen because it was used in perfor- 
mance studies of programming systems [10, 18] on earlier 
versions of the Butterfly. It simulates Gaussian elimination 
in the sense that it uses integer rather than floating-point 
operations, thus emphasizing the relative impact of memory 
performance with respect to the speed of arithmetic opera- 
tions. 

The design of PLATINUM targets factors such as ease of 
programming and performance, since these are the primary 
criteria by which the coherent memory abstractions should 
be judged. While other issues such as security, protection, 
and long-term storage have been considered in the abstract 
design, they have received only cursory attention in the cur- 
rent version. 


1.1. PLATINUM Programming Model 


Since our goal is the exploration of transparent NUMA 
memory management, we use familiar abstractions and in- 
terfaces as much as possible. This decision determined the 
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Figure 1: Gaussian Elimination (Speedup vs. Processors) 


interface presented to the user and some of the internal ker- 
nel interfaces. We used the Mach [23] model of memory as 
the prototype because of its modularization into machine- 
dependent and -independent parts. Within a stripped-down 
version of this model, PLATINUM coherent memory is im- 
plemented as a replacement for the machine-dependent part 
of the memory management system. 

PLATINUM exports to user programs an abstract mul- 
tiprocessor model in which all primary memory accessible 
to user programs appears to be a fast (on average) shared 
physical memory module uniformly accessible from all of 
the processors in the system. The physical location of data 
in primary memory is hidden from the user. PLATINUM 
allocates memory in page-aligned regions. Page boundaries 
are not hidden, enabling the user to reduce interprocessor 
interference by allocating shared data with different access 
patterns to distinct pages. 

The fundamental abstractions supported by PLATINUM 
are the thread, the memory object, the port, and the address 
space. These objects all appear in a single flat global name 
space. 

A memory object is an abstraction of an ordered list of 
memory pages. A range of pages within a memory object 
may be bound to any contiguous page-aligned virtual ad- 
dress range of the same size. Neither the virtual address 
range nor the access rights need be the same in every ad- 
dress space. Since they have global names, memory objects 
are the natural unit of data- or code-sharing between ad- 
dress spaces. 

A thread is a kernel-scheduled thread of control. At any 
time it is bound to a single processor. An explicit migration 
operation can move it to another location. It is, however, 
constrained to execute within a single address space. 

An address space is a list of bindings of memory objects 
and access rights to virtual address ranges. It defines the 


environment in which one or more threads may execute. 
The threads in a single address space may be distributed to 
multiple processors. 

A port is a message queue that can have any number of 
senders and receivers '. Messages are variable-length arrays 
of zero or more bytes. Globally named, ports provide a com- 
munication medium usable by threads that do not share a 
common memory object. They also provide blocking syn- 
chronization. 

Logical concurrency is realized through the use of mul- 
tiple threads to implement a single application. True par- 
allelism is realized by running those threads on multiple 
processors. Communication between threads can be based 
on either shared memory or message-passing via ports. 
Threads that coexist within a single address space share all 
of the memory objects mapped into that address space. This 
implies, in addition to data coherency, that these threads 
share a coherent view of the mappings of memory objects 
that constitute the shared space. A more restricted form of 
sharing is realized by mapping a memory object into multi- 
ple address spaces. The shared object can be accessed by all 
of the threads in those spaces, but the non-shared objects 
in each address space are protected from threads in other 
spaces. 

A comprehensive description of the interface can be found 
in [11]. Given the initial successes with PLATINUM, its in- 
terfaces are being extended as required to provide added 
functionality and ease of programming to support larger ex- 
periments. We are also adding an instrumentation interface 
to the kernel to help interpret its behavior. The design is 
intended to make it easy to integrate PLATINUM coherent 
memory with Mach. 


2 Organization and Implementation of 
the Memory Management System 


A typical virtual memory system has traditionally managed 
a memory hierarchy consisting of a cache, a uniformly acces- 
sible primary memory, and a significantly slower secondary 
memory. The existence of remote primary memory on a 
NUMA multiprocessor adds at least one more level to this 
hierarchy. PLATINUM memory management is structured 
to separate the traditional responsibilities of virtual mem- 
ory management from the additional requirements imposed 
by the NUMA architecture. The memory management sys- 
tem is constructed in three layers. The highest layer is the 
Virtual Memory system. The middle layer is the Coher- 
ent Memory system. The lowest layer is the Physical Map 
system. 


2.1 Organization 


The virtual memory system manages the mappings from 
virtual address ranges to memory objects and from memory 
objects to coherent pages (see the left side of Figure 2). The 
machine-independent part of Mach memory management is 
the prototype for this layer. 

The coherent memory system is responsible for the map- 
pings from coherent pages to physical pages. These may be 
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one-to-many. The left side of Figure 2 shows coherent-to- 
physical mappings for one of the three memory objects. 

The coherent memory system also guarantees the consis- 
tency of the physical pages backing a coherent page. This is 
implemented by extending a directory-based protocol that 
uses selective invalidation to maintain coherency [7]. For 
each coherent page the system maintains a directory of the 
set of physical pages backing it. A new physical page is 
added to the set when the system chooses to replicate the 
coherent page. The replication policy makes the decision 
between the replication of a coherent page and the creation 
of a mapping to an existing physical page. When a pro- 
cessor writes to a replicated coherent page, all but a single 
physical copy are invalidated and removed from the set. 

The implementation of the protocol makes heavy use of 
the hardware memory management unit (MMU), on the 
Butterfly Plus a Motorola MC68851. Access rights to phys- 
ical pages are potentially more restrictive than those spec- 
ified by the virtual memory system in order to ensure the 
generation of traps by memory accesses which require ac- 
tion. Most transitions in the protocol are thus initiated by 
address translation and protection faults, and are performed 
by the page fault handler. 

The physical map system is a simple machine-dependent 
page table and address translation cache management mod- 
ule. For each address space a physical page map (Pmap) 
is used to cache the compositions of the logical mappings 
maintained by the virtual and coherent memory systems. 
Each physical mapping illustrated on the right side of Fig- 
ure 2 is the composition of a corresponding sequence of map- 
pings on the left side of the figure. 


2.2 Implementation Strategy 


The promise of high performance, scalable parallelism us- 
ing a shared-memory model of computation makes NUMA 
multiprocessor architectures interesting. It is therefore vital 
that an operating system kernel be very efficient and avoid 
limiting the scalability of the system. The memory manage- 
ment system is implemented with this in mind. Kernel oper- 
ations and data structures are decentralized to provide max- 
imum concurrency. Wherever possible, atomic memory op- 
erations are used to implement concurrent data structures. 
When an explicit lock is needed, the scope over which it is 
held is kept small to reduce the residual impact of contention 
between concurrent kernel operations. Remote memory ac- 
cesses in critical sections are avoided, especially within the 
coherent page fault handler. In some cases the algorithms 
and data structures use several local memory accesses to 
avoid a single remote memory access. 

The kernel address space consists of two regions, one in 
physical memory and the other in coherent memory. Ker- 
nel code and the data structures for the lowest kernel layers 
are in physical memory. These structures implement phys- 
ical and coherent memory systems as well as physical de- 
vice handlers. The kernel replicates its code and read-only 
data. Since writable data in physical memory can only have 
one copy, each writable page in kernel physical memory is 
mapped for remote access by all but its local processor. 

The layers of the kernel that implement virtual memory, 
threads, and ports keep their data structures in the coherent 
memory region. Because they are in the coherent memory 
region, kernel stacks for threads require special handling. 
Otherwise, the first fault after a thread has moved would 
try to save the processor state on the kernel stack for the 
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Figure 2: Logical and physical page mappings in PLATINUM. The coherent-to physical mappings for only one of the the 
memory objects are shown. 


thread, generating a coherent memory fault. This circular 
dependence is broken by explicitly moving the kernel stack 
with the thread. 


2.3. Implementation Structure vaddr Pmaps for Cmap 1 Memory 
Cmap 1 - Module i 


The coherent memory system consists of two modules: 


1. The coherent map (Cmap) system is responsible for 
maintaining the coherency of the mappings from vir- 
tual to physical pages for each processor. The interface 
provided by the Cmap to the virtual memory system 
is similar to the Mach pmap interface. 


2. The coherent page (Cpage) system is responsible for 
allocating and freeing coherent pages as well as the the 
physical pages that back them. It also maintains their 
coherency and implements a replication policy. The 
Cpage system includes the page fault handler and a 
defrost daemon. 


The coherent and physical memory management systems 
use the following data structures (see Figure 3): 


e The mappings from virtual addresses to memory ob- 
jects and from memory objects to coherent pages are 


kept by the virtual memory system. For each address Figure 3: The data structures involved in the management 
space the coherent memory system caches the compo- of a Cpage mapped into two address spaces. The Cmap 
sition of these mappings in a Cmap. A Cmap contains message queues are not shown. 


a table of virtual-to-coherent page mappings (Cmap 
entries), a queue of Cmap messages describing recent 
changes to the address space, a bit mask denoting pro- 
cessors with this address space active, and a separate 
local page table (Pmap) for each of these processors. 
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e A Cmap entry is analogous to a page table entry. It 
contains a pointer to the coherent page, an access rights 
field, and a bit vector called the reference mask. If a 
processor has a virtual-to-physical translation for the 
coherent page in its Pmap, the bit corresponding to 
that processor is set. 


« A Cmap message describes a change made to a virtual 
address space that affects virtual-to-physical mappings 
held by two or more processors. It contains a virtual 
address and a directive either to invalidate the current 
translation or to restrict the access rights in it. Each 
processor is responsible for making these changes be- 
fore running any thread in that address space. 


e The Cpage table is the list of all coherent pages. Each 
entry in the Cpage table describes the state of a Cpage. 
This information includes a directory of physical pages 
backing the Cpage and indicates whether there is a 
virtual-to-physical translation allowing write access. 
The directory consists of a bit mask indicating which 
memory modules contain a physical page backing the 
Cpage and a list of these physical pages. An entry also 
records the time of the most recent invalidation and 
whether the Cpage has been frozen by the replication 
policy. 


e Each memory module contains an inverted page table 
describing the state of each physical page in the mod- 
ule. An entry indicates whether the physical page is 
allocated and to which coherent page. 


3 Shared-Memory Coherency 


The shared-memory coherency problem has two major 
facets, data coherency and address space coherency. Much 
of the literature on coherent caches for multiprocessors con- 
cerns the data coherency part of the problem. On UMA 
multiprocessors with coherent caches the address space co- 
herency problem is primarily a matter of maintaining the 
consistency of address translation caches [5]. Given the lack 
of any direct hardware support for either form of coherency, 
PLATINUM solves both aspects of the problem in one unified 
framework; data coherency and address space coherency are 
implemented using a mechanism for invalidating or restrict- 
ing the access granted by a mapping. 


3.1 A NUMA Multiprocessor 
Mechanism 


Shootdown 


When an address space is modified by the addition of new 
mappings or by relaxing the protection on a range of virtual 
addresses, it is easy to distribute the changes. Any proces- 
sor attempting to use its expanded privilege will cause a bus 
error and thus be able to discover and react to the change. 
On the other hand, when an address space is restricted by 
removing mappings or restricting access rights, some addi- 
tional mechanism is necessary to ensure consistency. For ex- 
ample, consider a UMA multiprocessor with a single shared 
page table per address space. Since page table entries are 
cached in the address translation cache (ATC) of each pro- 
cessor’s hardware memory management unit, these cached 
copies must be invalidated whenever the corresponding page 
table entry is invalidated or restricted. Because address 
translation caches are usually private to the processor to 
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which the MMU is attached, multiprocessor operating sys- 
tems such as Mach use a software shootdown mechanism to 
implement this part of the address space coherency protocol 
[5]. The PLATINUM shootdown mechanism is very different 
from that used in Mach. The differences arise largely be- 
cause the PLATINUM mechanism is designed specifically for 
NUMA multiprocessors. 

Because code and data are replicated in PLATINUM each 
processor needs to have its own private set of virtual-to- 
physical mappings for each address space. While Mach uses 
a single shared page table (Pmap) per address space, each 
processor in PLATINUM must have its own private Pmap 
per address space. Since a Pmap is only a cache of the 
valid virtual-to-physical translations, it need not contain 
mappings for everything in an address space, rather only 
a working set for that processor. Thus, in contrast with 
a scheme examined by Holliday [15], scalability is not re- 
stricted by replication of page tables. 

In addition to reducing latency and contention, using a lo- 
cal, private Pmap for each processor allows the construction 
of a fast shootdown mechanism. Black e# al. discuss two 
problems that result from multiple processors sharing a sin- 
gle Pmap in Mach. If the processor initiating the shootdown 
instructs a target processor to flush its ATC before updating 
the Pmap, the target processor may reload an inconsistent 
entry. If, on the other hand, the initiating processor up- 
dates the Pmap before instructing the target processor to 
flush its ATC, the target processor may write back its ATC 
entry to update the reference or modify bits, thereby creat- 
ing an inconsistent Pmap. Their solution to these problems 
is to stall the target processors while the initiator changes 
the Pmap. Since PLATINUM uses a Pmap per processor, it 
does not face either of these problems. 

A consequence of the replication of mapping information 
is that the Pmaps must be kept coherent as well as the 
ATCs. Part of the protocol is performed by the processor 
initiating the shootdown and part is performed by the pro- 
cessors sharing the address space with the initiator. They 
communicate through the Cmap message queues and syn- 
chronize through interprocessor interrupts. 

The initiating processor posts a short message describing 
the change of mapping to the Cmap message queue of each 
affected address space. A change to a specific address space 
affects only that address space, but a change of mappings 
required by the data coherency protocol must affect every 
address space in which the Cpage is mapped. Part of each 
message is the bit mask specifying the set of target proces- 
sors that eventually have to apply the change to their Pmap 
for this address space. This set is exactly the set of proces- 
sors appearing in the reference mask of each Cmap entry for 
this Cpage. The set of target processors is thus restricted 
to those that are actually using a mapping for this Cpage. 
Furthermore, a processor need only be interrupted to per- 
form the change if the address space is currently active. The 
remainder of the target processors will update their Pmaps 
when they activate the address space. In contrast, the Mach 
shootdown mechanism must interrupt each processor with 
the address space activated, even if that processor has never 
referenced the page. 

On the target processors the update is performed by a 
Cmap synchronization handler that is called as a result of 
an interprocessor interrupt or as part of the activation of 
an address space. Consequently, kernel code that runs at 
the interprocessor interrupt level or higher is not allowed to 
access coherent memory. The synchronization handler scans 


the queue of change messages. If the processor appears in 
the target mask of a message, it applies the change to its 
Pmap and removes itself from the target mask. When the 
target mask is clear, the message is removed from the queue. 

The memory management system obtains a significant re- 
duction of overhead by deactivating the kernel address space 
when a processor begins running in user mode. This reduces 
the number of interprocessor interrupts each processor re- 
ceives. When a processor reenters the kernel to service a 
trap or interrupt, it has to reactivate the kernel address 
space before it can access coherent memory. 


3.2. The Data Coherency Protocol 


The data coherency protocol is derived from a directory- 
based cache coherency protocol that uses selective invalida- 
tion of cache blocks [7]. When a processor tries to access a 
Cpage that has no local physical page backing it, the coher- 
ent memory mechanism can always choose either to make a 
local copy of a page or to create a mapping to an existing 
remote page. The ability to use remote mappings is espe- 
cially important when multiple processors make frequent, 
interleaved, and fine-grain modifications to a shared data 
structure. The resulting interprocessor interference causes 
the frequent execution of any protocol to maintain coher- 
ence among multiple copies. By using remote mappings 
the mechanism can, in effect, selectively and dynamically 
disable replication and migration when interference is de- 
tected. 
A coherent page can be in one of four states: 


empty means that there are no physical pages backing the 
Cpage. Thus, there are no virtual-to-physical map- 
pings to this page. 


presentl1 means that there is exactly one physical page 
backing the Cpage and all virtual-to-physical mappings 
are restricted to read access. A virtual-to-coherent 
mapping may permit write access to the Cpage, but 
the virtual-to-physical mapping is restricted in order 
to implement the coherency protocol. 


present+ means that there are two or more physical pages 
in different memory modules backing the Cpage. All 
virtual-to-physical mappings for the Cpage are re- 
stricted to read access. As above, a virtual-to-coherent 
mapping may permit write access to the Cpage. 


modified means that there is one physical page backing 
the Cpage and at least one virtual-to-physical mapping 
allows write access. 


Figure 4 is a transition diagram for the protocol. The 
present1 state is distinguished from the present-+ state 
for performance reasons. The transition from present to 
modified on a write miss requires the invalidation of at 
least one virtual-to-physical mapping and the reclamation 
of at least one physical page. The transition from present1 
to modified requires neither. 

Transitions between states are triggered by page faults 
or the defrost daemon. When a page fault occurs during an 
attempted access to a non-empty Cpage, the Cpage system 
can either map an existing physical copy for remote access, 
or create and then map a local physical copy. For example, 
if there is a write miss on a Cpage in the modified state, 
the choice is between mapping the existing physical copy or 
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Figure 4: State Transition Diagram 


allocating a local physical page, copying the data, and then 
invalidating the original copy. Similar decisions arise for the 
other cases. 

A policy module within the Cpage system chooses the 
appropriate action on each page fault. The current policy 
uses the history of recent invalidations for the Cpage. Re- 
cent invalidation indicates that the Cpage is being actively 
write-shared. The Cpage system uses this information to 
limit the overhead of running the protocol by forcing re- 
mote mappings for recently invalidated pages and allowing 
replication for the others. 


3.3. Replication and Data Coherency 


Both the replication mechanism and the data coherency pro- 
tocol are implemented by the page fault handler. When 
a page fault occurs, the Cpage fault handler searches the 
Cmap for an entry that maps the faulting virtual address. 
If an entry is found, the page fault is a coherent memory 
fault. Otherwise, the fault is passed to the virtual memory 
fault handler. 

The Cmap entry contains a pointer to an entry in the 
Cpage table. The fault handler tests the bit mask in the 
Cpage to discover whether a local physical page backs it 
(see Figure 3). Since Cpages may be shared by multiple 
address spaces, a local physical copy may already exist. If 
a local copy exists, the handler applies a hash function to 
the index of the Cpage and scans the inverted page table 
to find the physical page. The inverted page table is used 
rather than the list of physical pages in the directory for the 
Cpage because the former is guaranteed to use strictly local 
memory accesses, thus decreasing both latency and poten- 
tial contention. Even when contention is not a problem it 
is cheaper to scan over a few collisions in the inverted page 
table than to search the list of physical pages with remote 


memory accesses. 

If there is no local physical copy and the fault is a read 
miss, the fault handler consults the replication policy mod- 
ule to determine whether or not it should replicate the 
Cpage. If the Cpage is to be replicated, the handler uses 
the inverted page table to find a free physical page then allo- 
cates the physical page by entering the address of the Cpage 
in the inverted page table entry for the physical page. If the 
existing state of the Cpage is modified, the handler uses 
the shootdown mechanism to restrict all virtual-to-physical 
translations for the Cpage to read-only access. The handler 
then performs a block transfer from another physical copy, 
and adds the physical page to the directory. 

Similar sequences of actions occur on a write miss. For 
example, if the state of the Cpage is present+, the handler 
first uses the shootdown mechanism to invalidate all virtual- 
to-physical translations for the remote physical copies, and 
then frees all of these pages. The handler concludes by 
mapping the chosen physical copy of the Cpage with the 
necessary access rights. 

If the policy indicates that the Cpage should be frozen 
rather than replicated, there can only be one physical page 
backing the Cpage. Furthermore, the Cpage must be in a 
modified state. The handler creates a mapping for the re- 
mote physical page granting the full access rights permitted 
by the virtual memory system. 


4 Performance and Choice of Replica- 
tion Policy. 


The copying of data in a PLATINUM page migration oper- 
ation is a kernel-initiated, page-aligned block transfer of a 
known size. In the absence of contention this takes 1.11 ms 
for the default page size of 4K bytes. 

The total time for a read miss that replicates a non- 
modified page ranges from 1.34 ms to 1.38 ms. The shorter 
time occurs when the relevant kernel data structures are lo- 
cal, while the longer time results from remote data access. 
Of this time, copying the page accounts for 1.11 ms, and the 
fixed overhead of allocating and mapping a physical page 
accounts for the remaining 0.23 ms to 0.27 ms. 

A read miss that replicates a modified page takes from 
1.38 ms to 1.59 ms if only one processor has to be inter- 
rupted to restrict its mapping to read-only access. The fixed 
overhead in this case ranges from 0.27 ms to 0.48 ms. The 
additional cost compared to a read miss on a non-modified 
is due to the address-space coherency protocol. 

A write miss on a present+ page takes from 0.25 ms 
to 0.45 ms when only one processor has be interrupted to 
invalidate its mapping and one physical page is freed. For 
up to 16 processors, the incremental delay to the initiating 
processor of interrupting each additional processor to inval- 
idate a mapping and freeing a physical page is no more than 
17 ys. Freeing a physical page uses one remote memory read 
and one write, accounting for about 10 us of this time. We 
therefore believe that the incremental cost of interrupting a 
processor to restrict a mapping to be about 7 ys. In con- 
trast, Black et al. report an incremental cost of 55 ys on a 
16-processor NS32332 Encore Multimax [5]. 

These timings were gathered on a 16-processor BBN But- 
terfly Plus Multiprocessor. A processing node on this ma- 
chine consists of a 16.67MHz MC68020 with a MC68851 
MMU and 4 MBytes of physical memory. 
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4.1 When does it pay to migrate a page? 


To decide when it is appropriate to migrate a data page 
rather than make a remote mapping, it is necessary to es- 
timate the relative costs of each of these options. The fol- 
lowing analysis is based on the contention-free latency of 
remote memory access. Contention, both at the memories 
and in the switch, increases latency by serializing requests. 
In the presence of contention the benefits of migration or 
replication can be much higher than indicated here. 
Suppose a data structure, X, is shared and written by 
p processors; further suppose that X is the sole occupant 
of a coherent page. Each processor operates on X in a 
critical section as follows: obtain the lock for X, perform 
a computation f entailing r memory references on it, and 
release the lock. If this operation were encapsulated in a 
procedure call it might be performed in one of three ways: 


e The operation is executed by the processor requesting 
it and the data is not moved. The operation is an 
ordinary procedure call using any combination of local 
and remote memory references to access the data. 


e The data and the process executing the operation are 
co-located by moving the data. The operation is an 
ordinary procedure call using local memory references. 


e The data and the process executing the operation are 
co-located by performing a remote procedure call. Ac- 
cess to the data uses local memory references. 


While implementations of languages such as Emerald [17] 
on top of PLATINUM would utilize the third option, we re- 
strict ourselves to consider the choice between the first two. 
Let Cremote be the cost of using remote memory references, 
Ciocat be the cost of using local references, and Cmigrate be 
the cost of moving the data. It is cheaper to move the data 
when 

(1) 


where g(p) is the average number of data movements neces- 
sary to save a remote operation. It is the ratio of the total 
number of executions of f to the number of executions of f 
that would use remote memory access if the data were not 
moved. When p processors access X in strict round-robin 
order, g(p) = p/(p — 1). For example, consider two proces- 
sors that alternate in touching X. If X is not moved, there 
will be one remote and one local execution of f per cycle. 
If it is moved, there will be two local executions of f and 
two data movements per cycle. Thus g(2) = 2. This is the 
worst-case scenario. For large p, g(p) approaches 1. If the 
operations are not interleaved among the processors then 
g(p) can be less than 1. 


Let 


Cremote > 9(p)C migrate + Crocat; 


s be the size of a page expressed in terms of the typical unit 
of access. On the Butterfly Plus this is a 32-bit word. 


Ti be the time to perform a typical local memory reference 
on a 32-bit word. On a Butterfly Plus this is about 320 
ns. 


T. be the time to perform the corresponding remote mem- 
ory reference. On the Butterfly Plus this is about 
5000 ns to read a 32-bit quantity. Write operations 
are faster. 


Smin, 
minimum page size in words 


232 973 never 

0.48 149 435 never 
0.60 111 298 1784 
0.75 85 210 793 
1.0 61 141 412 
1.5 39 84 210 
2.0 28 61 141 


Table 1: Inequality 2 evaluated at some interesting points. 
It always pays to migrate data when the page size is greater 
than Simin. 


T, be the time to copy a word in a page migration operation. 
This is about 1100 ns on the Butterfly Plus. 


Define p = r/s to be the density of references to X. For 
example, if the size of X is s and f reads and writes every 
location in X, p = 2. On the other hand, if X occupies only 
half the page and f writes one half of X’s data, p = 0.25. 

We therefore have Ciocat = psTi, and Cremote = psTr. 
The cost of migration is divided into the cost of block trans- 
fer, sT,, plus a fixed overhead, about 0.48 ms in the current 
implementation. Substituting these into inequality 1 and 
rearranging the terms, we conclude that it always pays to 
migrate when 


1079(p) 
p — 0.249(p)” (2) 


Note that the constant in the numerator is proportional to 
the fixed overhead of the migration operation and that the 
coefficient of g(p) in the denominator is the ratio T}/(T; — 
T;). From this we make the following observations: 


e To determine when migration is economical, the ratio 
Tp /(I; — Ty) of block transfer time to the time that can 
be saved by using local rather than remote memory 
access operations is the single most important charac- 
teristic of the architecture. It puts a lower bound on 
the minimum reference density p for which migration 
makes sense for any block size. This in turn bounds 
the minimum usable page size. The existence of a fast 
block transfer mechanism is vital to the performance of 
any program that uses data migration and replication 
on a NUMA machine! 


e For each g(p) and p pair, the need to amortize the fixed 
overhead of the coherence protocol puts a lower bound 
on the page size that can be used economically. For 
a fixed g(p) and p, a decrease in overhead results in 
a proportional decrease in the minimum page size for 
which migration makes sense. 


e With round-robin access, as the number of processors 
sharing X increases, g(p) decreases towards 1, thus 
making migration more attractive. 


These factors determine the granularity of data access 
that must be seen in the application to ensure that migration 
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is always the correct action for a given page size. Some 
values for inequality 2 are presented in Table 1. 

This analysis emphasizes the importance of coarse data- 
access granularity for attaining good performance on a 
NUMA machine, especially if data migration on a per-page 
basis is used. A large page size allows us to better amor- 
tize the fixed overhead of data transfer and thus to tolerate 
a slightly smaller p. For a fixed granularity of data access 
smaller than the size of a page, however, p is inversely pro- 
portional to page size, thus negating any potential advan- 
tage of increasing page size. On the other hand, if a program 
has a granularity of data access that is greater than the size 
of a page, p remains more or less constant as pages grow. 
For a fixed problem size one would expect the granularity 
of sharing to decrease as the number of processors in a mul- 
tiprocessor increases. On the other hand, we believe, as 
do others [28, 14], that a major role of parallel machines is 
to solve ever-larger problems rather than to solve fixed-size 
problems in ever-shorter times. These larger problems will 
allow the continued use of coarse granularity as systems are 
made larger. 


4.2 Replication Policy 


While the replication and migration of data has significant 
benefit when data access granularity is relatively coarse, the 
overhead of trying to maintain coherency in the presence of 
fine-grained write-sharing could be prohibitively expensive. 
In such circumstances it is less expensive to access remote 
memory than to try to migrate or replicate the data. Since 
the choice between data movement and remote access de- 
pends upon the relative costs of the alternatives, we have 
delayed discussion of the replication policy until the details 
of the mechanism and of its cost have been presented. PLAT- 
INUM is designed to support experimentation with a family 
of policies. We focus on the interim policy currently used. 

Since invalidations occur as a result of interprocessor in- 
terference, all policies use a recorded history of recent in- 
validations to estimate the interference for-each coherent 
page. The current version uses a minimal history consisting 
of a timestamp for the most recent invalidation by the co- 
herency protocol of a mapping for that coherent page. On 
a page fault handled by coherent memory, a coherent page 
is replicated or migrated if the the last invalidation by the 
protocol was at least ¢; in the past. Otherwise it is frozen 
rather than replicated. Since invalidations cause the Cpage 
to go into the modified state and since it could not have 
been replicated since then, there can only be one physical 
page backing a frozen Cpage. While it remains frozen, all 
new mappings to a Cpage are to that single physical page. 
We have used two policies for dealing with faults occurring 
after the t; ms period expires on a frozen Cpage. The de- 
fault policy is to continue to create remote mappings for the 
Cpage until the page is explicitly thawed. The alternative 
is to allow the frozen coherent page to be replicated and 
thus thawed as a consequence of an attempted access. The 
programs we have examined thus far exhibit no significant 
difference in performance between these policies. 

Based on the speed of the Butterfly processor and the 
need to amortize the replication of a coherent page over 
a reasonable number of accesses, ti is currently set to 10 
ms. A few tests indicated that application performance is 
insensitive to varying t; from 10 ms up to about 100 ms. 
Once the collection of application programs has grown to a 


reasonable size we will perform systematic experiments on 
the effects of varying this and other parameters. 

After all of the threads that share a frozen Cpage have 
mappings to it, further access to that Cpage causes neither 
additional faults nor the associated overhead. Since the co- 
herency protocol as described thus far is driven strictly by 
page faults, the Cpage could remain frozen permanently. 
While it may be appropriate to freeze a Cpage at a partic- 
ular point in the execution of a program, a change in the 
access pattern of that page may make it desirable to thaw it 
in the future. PLATINUM therefore has a simple mechanism 
for thawing pages, thus allowing the memory management 
system to react to phase changes as well to thaw any incor- 
rectly frozen pages. 

The Cpage module maintains a list of frozen Cpages and 
a clock interrupt every t2 seconds activates the defrost dae- 
mon to invalidate all mappings to the frozen pages. Subse- 
quent access attempts will cause faults that may replicate 
or migrate a recently thawed coherent page. To keep the 
overhead low, #2 is currently set to 1 second. Reducing te 
may allow coherent pages frozen accidentally to be repli- 
cated sooner, but it just adds overhead for coherent pages 
that should remain frozen. 

An alternative is to maintain the list of frozen pages as 
a priority queue ordered by thaw time. This allows the 
daemon to run more often than every t2 seconds. It also 
allow t2 to be set adaptively on a per-page basis. Although 
there is evidence that thawing frozen pages is important for 
performance, we do not yet have reason to believe that a 
more sophisticated policy for thawing will have much effect. 
Since a more sophisticated policy would add overhead to 
the system, we plan to continue to use the simple policy 
described above until the problem is better understood. 

A possible reason for the access pattern of a page to 
change is that two or more variables with different access 
patterns are in that page. For example, co-locating a syn- 
chronization variable such as a lock or event count with a 
read-only variable on one page can lead to problems because 
they demand very different treatments from the memory 
management system. Active use of synchronization vari- 
ables will cause their pages to be frozen while a read-only 
variable should be replicated. The preferred solution to this 
problem is for the programmer, the compiler, and the lan- 
guage run-time support to be intelligent about the alloca- 
tion of variables to virtual pages. Even if this allocation is 
done poorly, thawing can salvage reasonable performance if 
each variable is used primarily in a different phase of the 
program. 

Experiences with our first version of the Gaussian elimi- 
nation program, described in the next section, provide anec- 
dotal evidence of the importance of intelligent memory al- 
location, thawing, and performance instrumentation. The 
program takes the problem matrix size as a parameter and 
writes this value to a variable during the startup phase. The 
matrix size is used in the termination test of the inner loop 
of the algorithm so it is vital that each processor have a 
local copy. The slave threads did not make private copies of 
this variable, but the page was replicated. Later we added 
a spin-lock variable to facilitate measurement of execution 
times. It is used as a barrier at the start of the elimina- 
tion phase of the program and is not touched thereafter. 
Spinning on the lock froze the Cpage. Consequently, all 
but one thread generated a remote access in its inner loop. 
This increased the latency for accessing the shared variable. 
This dramatically increased the execution time and became 
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a bottleneck with five or more processors. 

Tn addition to timing data, the kernel produces a detailed 
report on the behavior of memory management. For each 
Cpage this includes the number of coherent memory faults, 
a measure of contention in the Cpage fault handler for that 
page, and whether the Cpage was frozen by the replication 
policy. Given this instrumentation it was a simple matter to 
diagnose the problem and program around it by giving each 
thread a private matrix-size variable. Thawing was soon 
added to the kernel and the old version of the program took 
less than two seconds more to run than the new version. The 
overhead of running the defrost daemon adds no measurable 
overhead to the new version of the program. 


5 Application Performance 


We report preliminary performance measurements for three 
application programs running on PLATINUM. Each of these 
programs has a memory access pattern distinct from the 
others. 


5.1 Gaussian Elimination 


The first application we examined was the simulation of 
Gaussian elimination described in the introduction. This 
particular computation was chosen because it had been 
studied previously on an earlier version of the Butterfly 
for a variety of programming systems and styles [18, 19]. 
LeBlanc compared the performance of an implementation 
on the Uniform System from BBN [3] with Gaussian elimi- 
nation implemented on SMP [20], a message passing library 
developed at the University of Rochester. We used the same 
800x800 matrix as LeBlanc. 

The PLATINUM implementation is similar to the coarse- 
grain implementation on the Uniform System found to be 
the most efficient in LeBlanc’s study. There is a single 
thread per processor and each thread is statically allocated 
a number of rows of the matrix. In each round some thread 
selects a pivot row which is then read by all of the other 
threads. Each thread then performs the elimination opera- 
tion on its set of rows. 

The differences between the two versions of the Butter- 
fly reduce the accuracy of quantitative comparisons of per- 
formance measures. Nevertheless, such measures provide a 
framework for qualitative comparison. The program run- 
ning on PLATINUM yields a 16-processor speedup of 13.5 
versus 10.6 for the Uniform System program [18]. In con- 
trast, the SMP message-passing implementation yielded a 
speedup of 15.3. 

An examination of the post-mortem statistics gathered 
by the kernel shows that the PLATINUM implementation ex- 
hibits high contention in the Cpage fault handler for Cpages 
that contain the pivot rows. This is attributable to a seri- 
alization in hardware of the replication of the data backing 
a Cpage. As expected, only the Cpage containing an array 
of event counts used for synchronization was frozen. 


5.2. Merge Sort 


This is a parallel merge sort using a simple tree of merge 
operations, each of which is performed by a single thread. 
We chose this program because it had been studied on a 
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Figure 5: Merge Sort Speedup 


Sequent Symmetry Multiprocessor [1]. The Sequent Sym- 
metry is a UMA multiprocessor. The one used in the study 
had model A processors with 8Kbyte write-through caches. 
Figure 5 shows the measured speedup curves for this pro- 
gram. The program shows better speedup running on the 
Butterfly Plus under PLATINUM than on the Sequent Sym- 
metry for the same size problem on the same number of pro- 
cessors. We believe this is due to the small cache size and 
write-through policy on the Sequent. During each merge 
phase one half of the data to be merged will already be in 
the merging processor’s local memory. Furthermore, with 
the linear access pattern of merging, the processor will touch 
all of the data prefetched by each coherent page fault. The 
problem is large enough, however, that none of the data will 
remain in the Sequent cache between merge phases. 


5.3 Neural Network Simulator 


A very different application is a simulator used by neural 
network researchers at the University of Rochester studying 
recurrent backpropagation networks [27]. Unlike the oth- 
ers, this program was developed by someone with no previ- 
ous experience programming the Butterfly Plus. While the 
other programs were written to exploit coarse-grain paral- 
lelism on large amounts of data, the simulator operates on 
much less data and at a very fine granularity. 

We measured (Figure 6) the performance of a simulation 
of a three layer network learning a classic encoder problem 
(24]. There were 40 units and 16 pairs of inputs and outputs. 
The simulator is parallelized by simple for-loop paralleliza- 
tion on units. Each processor continually simulates a set 
of units depending only on the atomicity of memory opera- 
tions for synchronization when it accesses data shared with 
other threads. The non-determinism produced by the lack 
of synchronization introduces negligible variability of execu- 
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Figure 6: Recurrent Backpropagation Simulator (Speedup 
vs. Processors) 


tion time. Given the very fine-grain nature of the algorithm, 
PLATINUM cannot use replication or migration to good ad- 
vantage. The coherent memory system quickly gives up and 
the data pages of the application are frozen in place. The 
speedup curve is linear over the range measured, but the 
extensive use of remote accesses limits the contribution of 
each incremental processor to about 1/2 that of a processor 
that makes only local memory references. 


6 Experiences Programming on a Co- 
herent Memory 


In our experience, it is much easier to write applications to 
run on coherent memory than to run on non-uniform physi- 
cal memory. PLATINUM programs are smaller than both 
Uniform System programs and programs using message- 
passing styles because one need not write code either for 
explicit communication, or for explicit management of data 
location. For example, the code for the elimination phase 
of the PLATINUM, Uniform System, and SMP implementa- 
tions of Gaussian elimination are 17, 41, and 64 lines long, 
respectively. 

Despite the apparent familiarity of PLATINUM’s abstract 
machine model, a programmer still needs to understand and 
apply certain fundamental facts about parallel programming 
on a NUMA machine. It is of overwhelming importance to 
avoid programming styles entailing fine-grain write-sharing. 
Whether memory is being managed automatically by the 
coherent memory system or explicitly by the programmer, 
this fine-grain write-sharing introduces both latency that re- 
duces the effective processor speed and memory contention 
that serializes logically parallel computations. It is vital 
that most of the sharing of writable data be done at coarse 


enough spatial and temporal granularities that a fast block 
transfer mechanism can be used effectively. 

In order for the coherent memory system to effectively 
manage data location, the programmer or compiler must un- 
derstand the sharing properties of data. Data with different 
access patterns should not be co-located on a single page. 
The private data of each thread should be separated from 
private data of other threads and from shared data. Read- 
only data should be kept separate from modifiable data. 
Coarse-grain modifiable data should be separated from fine- 
grain modifiable data such as locks. A run-time library for 
defining disjoint memory allocation zones and for specify- 
ing page-aligned allocation helps PLATINUM programmers 
to do this with a minimum of effort, even without com- 
piler support. Because a typical NUMA multiprocessor has 
a very large physical memory, the internal fragmentation 
introduced by this strategy has little impact and is vastly 
preferable to interprocessor interference. 


7 Architectural Considerations 


The benefits of replication cannot be measured solely in 
terms of the ratio of local to remote memory access times. 
As the degree of parallelism increases on a machine with 
a large number of processors, contention for memory mod- 
ules and for the interconnection network become the domi- 
nant factors determining performance. The most important 
impact of coherent memory is that it effectively uses local 
memories as caches to reduce contention. 

An effective block transfer mechanism is critica) to an ef- 
ficient implementation of coherent memory. It should be 
both fast and asynchronous with respect to program execu- 
tion. The analysis in Section 4.1 quantifies the importance 
of block transfer speed in one scenario. Although the But- 
terfly Plus has a fast, asynchronous block transfer mecha- 
nism, it consumes 75% of the available local memory bus 
bandwidth on both nodes involved in the transfer. Both 
processors are memory-starved during a block transfer. Re- 
designing the memory system to allow more concurrency 
between processing and block transfers would help to re- 
duce further the effects of memory contention. 

Although the Butterfly Plus does not have data caches 
in the processor nodes, the PLATINUM coherent memory 
system is compatible with a generation of NUMA multipro- 
cessors with local caches but without internode coherency 
support. In addition to reducing latency on local memory 
operations, local data caches would reduce contention for 
the local memory module between the local processor and 
remote memory operations. Such local data caches could be 
relatively cheap because they need not incorporate a hard- 
ware cache coherency mechanism. Cache coherency would 
be maintained by the coherent memory system. Almost all 
data is cachable. Only modified Cpages that are mapped 
by remote processors cannot be cached. Replicating a mod- 
ified Cpage would, however, require flushing a write-back 
cache, slowing the invalidation operation. 


8 Related Work 


The management of NUMA memory is a topic of consider- 
able current interest. Recent studies of methods for manag- 
ing the location of data in a NUMA machine include the 
analysis and simulation of competitively optimal NUMA 
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memory management by Black et al. [4], Scheurich and 
DuBois’ simulation of data migration in mesh-connected 
NUMA machines [25], and Holliday’s simulation of data mi- 
gration on a Butterfly [16]. The design of the Psyche mem- 
ory manager [21] contains a layer that deals with NUMA 
data location issues. 

In Bolosky’s addition of NUMA memory management 
to Mach on the IBM ACE Multiprocessor Workstation [6] 
writable pages are never replicated and are allowed to mi- 
grate only a small number of times before being frozen in 
global shared memory. While this results in performance im- 
provements compared to static placement, our experiences 
with coarse-grained sharing indicate that there is room for 
improvement. If write operations on a large piece of a data 
structure are not interleaved at a fine grain, it continues to 
be appropriate to migrate data throughout the lifetime of 
an application. Further it is not only appropriate to repli- 
cate immutable objects, but also those modifiable objects 
that are, either by accident or design, not modified during 
some phase of execution. 

While one effect of replication and migration in the PLAT- 
INUM coherent memory system is the reduction of latency, 
we contend that for large hardware configurations a far more 
important benefit is the reduction of memory and switch 
contention. Therefore, we have have not expended much 
effort trying to tune the mechanism for the optimal place- 
ment of frozen pages that are being actively modified at a 
fine granularity by multiple processors. While careful place- 
ment and migration can reduce average access latency in the 
absence of contention, there is no demonstrated reduction in 
contention. Since the proposed placement mechanisms are 
not cheap, entailing hardware reference counts [4, 25] or sim- 
ulations of reference counting in software [16, 21], we believe 
that it is better to have a simple, low-overhead placement 
policy and to devote more resources to reducing contention 
by reducing the amount of fine-grain write-sharing. 


9 Status and Future Directions 


Our experiences thus far indicate that the PLATINUM mem- 
ory management system will achieve its goals. Foremost, 
the memory management system makes it easier to pro- 
gram a NUMA architecture without an unacceptable sacri- 
fice in performance. Although initial programming experi- 
ments used the kernel interface directly without too much 
programmer effort, we are rapidly accumulating run-time 
libraries, shells, and other support software to further ease 
the programming process. An important part of this will 
be the installation of instrumentation for performance mon- 
itoring, analysis, and visualization [12]. The feedback from 
such instrumentation is useful to application programmers, 
compiler writers,.and system implementors for NUMA ma- 
chines. 

We are continuing to study the behavior of the coher- 
ent memory system under a variety of applications. Once 
the collection of applications has grown to a reasonable size 
we will systematically experiment with the implementation 
by changing parameters such as page size and replication 
policy. 

The kernel itself is designed to scale well to machines with 
a much larger number of processors. Its decentralized design 
keeps the number of remote memory accesses in the kernel 
to a minimum. We are particularly pleased with the success 
of the decentralized and concurrent implementation of the 


coherency protocol, especially the low incremental cost per 
shootdown and the techniques for reducing the number of 
processors involved in a shootdown. 

Although providing coherent memory transparently in 
the operating system has proven itself useful, it is not hard 
to construct scenarios in which better performance could 
be obtained if interface between the application and the 
memory management system were not so transparent. The 
kernel interface will be extended to support these. While 
such information could be provided by the programmer di- 
rectly, this additional burden runs contrary to the goal of 
providing a simple programming environment. We there- 
fore anticipate that these hooks will be utilized primarily 
by programming languages and their run-time support. 

In its current incarnation, PLATINUM is a limited experi- 
mental platform for experimenting with the implementation 
of coherent memory. We will extend it as necessary to serve 
this purpose. On the other hand, dealing with issues such 
as file systems and protection is not in our plans. When and 
if it becomes appropriate to make coherent memory avail- 
able in a general-purpose operating system, we anticipate 
reintegrating those parts of PLATINUM with Mach. 
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